1 Introduction

This document is an online supplement to

Evert, Stephanie (2022). Measuring Keyness. In: Proceedings of the Digital Humanities 2022 Conference, online/Tokyo, Japan.

It provides a reference implementation of the LRC keyness measure proposed in this paper, as well as replication code for the visualisation and analysis.

An user-friendly implementation of LRC will be made available as part of the keyness() function in the R package corpora version 0.6 (to be released by August 2022). [Update: now scheduled for July 2023]

2 Keyness measures

2.1 Corpus data & sampling model

Keyness measures compare the relative frequency \(\pi_1\) (= occurrence probability) of a candidate term \(w\) in a linguistic population A with its frequency \(\pi_2\) in a population B, based on independent random samples (= corpora) A and B. This computation is repeated for each candidate term \(w\in C\), then candidates are ranked according to their keyness scores and cutoff thresholds may be applied.

For each candidate term, the observed data consist of

  • \(f_1\) = frequency of \(w\) in sample A
  • \(n_1\) = sample size of A (total number of tokens)
  • \(f_2\) = frequency of \(w\) in sample B
  • \(n_2\) = sample size of B (total number of tokens)

Recent studies (Evert et al. 2018, Egbert & Biber 2019) suggest it is more robust to use document frequencies instead of term frequencies; in this case, \(n_1\) and \(n_2\) are the numbers of documents in the two samples.

NB: It is quite common to have some candidates with \(f_2 = 0\), i.e. terms that only occur in sample A, but not in sample B. In a symmetric keyword analysis (where A and B have equal status), the case \(f_1 = 0\) is also possible.

The observed candidate data can be collected in the form of a \(2\times 2\) contingency table, which represents a cross-classification of items and is the basis of many statistical techniques. The first column corresponds to sample A, the second column to sample B; the first row lists the frequency of the candidate term \(w\), the second row the total frequency of all other terms \(w'\).

\[ \begin{array}{|c|c|} \hline O_{11} = f_1 & O_{12} = f_2 \\ \hline O_{21} = n_1 - f_1 & O_{22} = f_2 - n_2 \\ \hline \end{array} \]

The columns of this contingency table form two independent binomial distributions with

\[ \mathbb{P}(f_1) = \binom{n_1}{f_1} \pi_1^{f_1} (1 - \pi_1)^{n_1 - f_1} \qquad \mathbb{P}(f_2) = \binom{n_2}{f_2} \pi_2^{f_2} (1 - \pi_2)^{n_2 - f_2} \]

that can be approximated very well by Poisson distributions

\[ \mathbb{P}(f_1) = \frac{e^{-\lambda_1}\cdot \lambda_1^{f_1}}{f_1 !}, \;\; \lambda_1 = n_1 \pi_1 \qquad \mathbb{P}(f_2) = \frac{e^{-\lambda_2}\cdot\lambda_2^{f_2}}{f_2 !}, \;\; \lambda_2 = n_2 \pi_2 \]

for the typical use case of lexical frequencies (\(\pi_1, \pi_2 \ll 1\)).

2.2 Significance measures

Perhaps the most widely used keyness measures are based on statistical hypothesis and assess the statistical significance of each candidate term \(w\), i.e. the amount of evidence against the null hypothesis

\[ H_0: \pi_1 = \pi_2 \]

where there is no frequency difference between the two populations. We can compute the expected values of the contingency table under \(H_0\) (cf. Evert 2008: 1251f) as

\[ \begin{array}{|c|c|} \hline E_{11} = \frac{n_1 (f_1 + f_2)}{n_1 + n_2} & E_{12} = \frac{n_2 (f_1 + f_2)}{n_1 + n_2} \\ \hline E_{21} = n_1 - E_{11} & E_{22} = n_2 - E_{12} \\ \hline \end{array} \]

Most statistical tests for contingency tables are based on a comparison of the observed (\(O_{ij}\)) and expected (\(E_{ij}\)) continency tables. Popular significance keyness measures are the chi-squared test (Scott 1997: 236)

\[ X^2 = \sum_{i=1}^2 \sum_{j=1}^2 \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

and in particular the log-likelihood measure (Rayson & Garside 2000: 3)

\[ G^2 = 2 \sum_{i=1}^2 \sum_{j=1}^2 O_{ij} \log \frac{O_{ij}}{E_{ij}} \]

The resulting scores indicate the amount of information against \(H_0\) provided by the two samples, i.e. higher values indicated stronger keyness. They can be translated into p-values, or alternatively cutoff thresholds corresponding to customary significance levels can be specified. It is common in keyword analysis to work with the raw keyness scores, though.

Both \(G^2\) and \(X^2\) are two-sided tests, i.e. they do not distinguish between the alternative hypotheses \(\pi_1 > \pi_2\) (which we’re interested in) and \(\pi_1 < \pi_2\) (sometimes referred to as negative keywords). For this reason, I prefer to assign a negative sign to \(G^2\) and \(X^2\) scores if \(f_1/n_1 < f_2/n_2\) (or equivalently \(O_{11} < E_{11}\)). NB: many other software packages do not make this adjustment, but may provide separate filters for positive and negative keywords.

Here I provide a reference implementation of log-likelihood \(G^2\) with negative scores for negative keywords. The function G2() must be called with arguments \(f_1, f_2, n_1, n_2\), where \(f_1, f_2\) are vectors of the same length (with corresponding entries for each candidate term \(w\)) and \(n_1, n_2\) are usually scalar values specifying the fixed sample sizes of A and B.

G2.term <- function (O, E) {
  res <- O * log(O / E)
  res[O == 0] <- 0
  res
}
G2 <- function (f1, f2, N1, N2, alpha=NULL, correct=TRUE) {
  stopifnot(length(f1) == length(f2))
  
  ## observed and expected contingency tables
  N <- N1 + N2
  R1 <- f1 + f2
  O11 <- f1;      E11 <- R1 * N1 / N
  O12 <- f2;      E12 <- R1 * N2 / N
  O21 <- N1 - f1; E21 <- N1 - E11
  O22 <- N2 - f2; E22 <- N2 - E12
  
  ## log-likelihood statistic (simplest formula)
  G2 <- 2 * (G2.term(O11, E11) + G2.term(O12, E12) + G2.term(O21, E21) + G2.term(O22, E22))
  res <- sign(O11 - E11) * G2 # set sign to distinguish positive vs. negative keywords
  
  ## weed out non-significant items if alpha is specified
  if (!is.null(alpha)) {
    if (correct) alpha <- alpha / length(f1)
    theta <- qchisq(alpha, df=1, lower.tail=FALSE)
    res[G2 < theta] <- 0 # set to 0 if not significant at level alpha
  }
  
  res
}

2.3 Significance thresholds

Significance measures can also be used to weed out candidates that are not statistically significant. Appropriate thresholds for a given significance level \(\alpha\) are obtained from the asymptotic \(\chi_1\) distribution of \(X^2\) and \(G^2\) under \(H_0\). If a significance level is specified with the alpha= argument, G2() will set return a keyness score of 0 for any candidates that are not significant at level \(\alpha\).

Hardie (2014: 51-52) argues that such significance filters need to take multiple testing into account: an independent significance test is carried out for each candidate that can generate a false positive with probability \(\alpha\). Even a conservative significance level such as \(\alpha = .001\) might result in hundreds or thousands of false positives when applied to hundreds of thousands of candidates extracted from a large corpus. It is therefore advisable to control the family-wise error rate (FWER), i.e. the probability of generating at least one false positive in the entire set of \(m\) tests. Hardie argues for a Šidák correction, which reduces the significance level for individual tests to

\[ \alpha' = 1 - (1 - \alpha)^{\frac1m} \]
while I prefer the more conservative, but simpler and numerically more robust Bonferroni correction

\[ \alpha' = \frac{\alpha}{m} \]

G2() applies a Bonferroni correction by default if alpha= is specified, based on the number of candidates passed to the function, i.e. \(m =\) length(f1). The adjustment can be disabled by passing correct=FALSE.

2.4 Effect-size measures

Significance measures have repeatedly been criticised in corpus linguistics because they focus entirely on the amount of evidence against \(H_0\) rather than on how big the difference between \(\pi_1\) and \(\pi_2\) is. As a result, high-frequency candidate term may obtain high keyness scores even if the difference in relative frequency between A and B is quite small, making them irrelevant from a linguistic perspective.

Several researchers have thus recommended the use of effect-size measures in keyword analysis. The simplest and most intuitive measure is relative risk, the ratio between the relative frequencies in the two samples:

\[ r = \frac{f_1}{n_1} / \frac{f_2}{n_2} \]

Hardie (2014: 45) proposes LogRatio \(= \log_2 r\) (i.e. logarithmic relative risk) as an intuitive keyness measure. The logarithm transforms scores into a more conveniently readable range and ensures that positive and negative keywords can be distinguished by the sign of the score.

Most other effect size measures that have been suggested in the context of keyword analysis are either equivalent to LogRatio – such as %DIFF \(= 100\cdot (r - 1)\) (Gabrielatos & Marchi 2012) – or closely related – such as odds ratio (Pojanapunya & Watson Todd 2018)

\[ \hat{\theta} = \frac{f_1}{n_1 - f_1} / \frac{f_2}{n_2 - f_2} \]

and \(\mathbf{\Delta P} = (r - 1) \frac{f_2}{n_2}\) (Gries 2013: 143f).

One problematic aspect of LogRatio and most related effect-size measures is that they are not defined for the case \(f_2 = 0\). Hardie (2014: 45) recommends to set \(f_2 = \frac12\) in this case (and \(f_1 = \frac12\) vice versa). It is difficult to justify this particular value, though, and other researchers have proposed much smaller values such as setting \(\frac{f_2}{n_2} = 10^{-18}\) (one quadrillionth, Gabrielatos & Marchi 2012).

[Update] A better solution is to use Walter’s (1975) adjusted estimator \[ \text{LogRatio} = \log_2 \frac{f_1 + \frac12}{n_1 + \frac12} - \log_2 \frac{f_2 + \frac12}{n_2 + \frac12} \] which is well-defined for \(f_1 = 0\) or \(f_2 = 0\) and provides a nearly unbiased estimate of \(\log_2 r\) (Gart & mo Nam 1988: 324).

[Update] My reference implementation of LogRatio defaults to Walter’s adjustment but also provides Hardie’s version, which is implemented in the CQPweb corpus platform (Hardie 2012). The LogRatio() function takes the same arguments as G2() and can optionally be combined with a \(G^2\)-based significance filter, setting keyness scores to 0 for any candidates that are not statistically significant.

LogRatio <- function (f1, f2, N1, N2, adjust=c("walter", "hardie"), alpha=NULL, correct=TRUE) {
  stopifnot(length(f1) == length(f2))
  stopifnot(all(f1 >= 1))
  adjust <- match.arg(adjust)

  ## point estimate of LogRatio = log2(relative risk)
  if (adjust == "hardie") {
    f2.disc <- pmax(f2, 0.5) # questionable discounting of f2=0 according to Hardie (2014)
    res <- log2((f1 / N1) / (f2.disc / N2))
  } else {
    res <- log2((f1 + 0.5) / (N1 + 0.5)) - log2((f2 + 0.5) / (N2 + 0.5))
  }
    
  ## weed out non-significant items if alpha is specified
  if (!is.null(alpha)) {
    G2.scores <- G2(f1, f2, N1, N2, alpha=alpha, correct=correct)
    res[G2.scores == 0] <- 0 # set LogRatio to zero if difference is not significant according to G2
  }
  
  res
}

2.5 Heuristic measures

Pure effect-size measures, as those listed above, have an important drawback: they ignore statistical uncertainty in the form of sampling variation and are prone to yielding spuriously high scores for low-frequency candidate terms. The resulting low-frequency bias of effect-size measures can only partially be mitigated by adding a significance filter or a frequency threshold such as \(f_1 > k\).

Various heuristic variations of effect-size measures have been proposed to control the low-frequency bias more effectively. The most widely-used example is SimpleMaths (Kilgarriff 2009) implemented by the SketchEngine platform:

\[ \mathop{\text{SM}} = \frac{10^6\cdot \frac{f_1}{n_1} + \lambda}{10^6\cdot \frac{f_2}{n_2} + \lambda} \]

where \(\lambda > 0\) is a user-parameter that adjusts the frequency bias of SM (with large values resulting in a bias towards high-frequency candidates). The scaling factor of \(10^6\) was chosen so that \(\lambda = 1\) is a practical default value.

My reference implementation SimpleMaths() takes the additional argument lambda=. Following SketchEngine, it does not offer a combination with a significance filter. As an extension, log=TRUE returns logarithmic values with a more reasonable scaling and sign corresponding (roughly) to positive vs. negative keywords.

SimpleMaths <- function (f1, f2, N1, N2, lambda=1, log=FALSE) {
  p1 <- f1 / N1
  p2 <- f2 / N2
  res <- (1e6 * p1 + lambda) / (1e6 * p2 + lambda)
  if (log) log2(res) else res
}

2.6 LRC = Conservative LogRatio

Heuristic corrections such as SimpleMaths are unsatisfactory because they lack any mathematical justification, even if they might work reasonably well in practice. Due to the wide range of different applications of keyness measures and the paucity of thorough empirical evaluation studies, it seems particularly important to use a keyness measure with well-understood mathematical properties.

The mathematical reason behind the unreliability of effect-size measures is that they do not directly measure the quantity of interest, e.g. the relative risk or odds ratio (etc.) between populations A and B:

\[ \rho = \frac{\pi_1}{\pi_2} \qquad \theta = \frac{\pi_1}{1 - \pi_1} / \frac{\pi_2}{1 - \pi_2} \]

but rather estimate the quantity from random samples A and B. Such point estimates fail to take sampling variation into account, resulting in a high degree of uncertainty for low-frequency data.

A better approach is to determine confidence intervals of plausible values (e.g. \([\rho_{-}, \rho_{+}]\)) from the observed contingency table. These are obtained by inverting statistical hypothesis tests, collecting all values \(x\) for which the null hypothesis (\(H_0: \rho = x\)) cannot be rejected at a given significance level \(\alpha\). A conservative effect-size measure uses the boundary of the confidence interval that is closer to the null hypothesis (\(H_0: \rho = 1\)) as a keyness score. If \(\rho_{-} \leq 1 \leq \rho_{+}\), the candidate is not significant and should be filtered out.

I propose LRC (conservative LogRatio) as a best-practice keyness measure. LRC determines a conservative estimate for \(\log_2 \rho\) at a family-wise level, i.e. using a Bonferroni-corrected significance level \(\alpha'\).

LRC is based on an exact conditional Poisson test (Fay 2010: 55) and confidence interval, using the approximated Poisson sampling distribution of the observed contingency table with rates \(\lambda_1 = n_1 \pi_1\) and \(\lambda_2 = n_2 \pi_2\). Conditioning on the row sum \(f_1 + f_2\) results in a binomial distribution (Lehmann & Romano 2005: 125)

\[ \mathbb{P}(f_1 | f_1 + f_2) = \binom{f_1 + f_2}{f_1} \left( \frac{\lambda_1}{\lambda_1 + \lambda_2} \right)^{f_1} \left( 1 - \frac{\lambda_1}{\lambda_1 + \lambda_2} \right)^{f_2} \]

An exact Clopper-Pearson confidence interval for the binomial parameter \(\tau = \frac{\lambda_1}{\lambda_1 + \lambda_2}\) can be obtained efficiently via (the inverse of) the incomplete Beta function. This corresponds to the “central” method recommended by Fay (2010: 53). The code below – copied from the corpora package – is equivalent to the implementation in the built-in binom.test() function.

binom.confint <- function(k, n, conf.level=0.95, correct=FALSE,
                          alternative=c("two.sided", "less", "greater")) {
  alternative <- match.arg(alternative)
  stopifnot(all(k >= 0) && all(k <= n) && all(n >= 1))
  stopifnot(all(conf.level >= 0) && all(conf.level <= 1))

  ## significance level for underlying hypothesis test (with optional Bonferroni correction)
  alpha <- if (alternative == "two.sided") (1 - conf.level) / 2 else (1 - conf.level)
  if (correct) alpha <- alpha / length(k) # Bonferroni correction
  alpha <- rep_len(alpha, length(k))      # needs to be vector for safe.qbeta() 
  
  ## Clopper-Pearson method: invert binomial test (using incomplete Beta function)
  lower <- safe.qbeta(alpha, k, n - k + 1)
  upper <- safe.qbeta(alpha, k + 1, n - k, lower.tail=FALSE)
  switch(alternative,
         two.sided = data.frame(lower = lower, upper = upper),
         less      = data.frame(lower = 0,     upper = upper),
         greater   = data.frame(lower = lower, upper = 1))
}

## safely compute qbeta even for shape parameters alpha == 0 or beta == 0
safe.qbeta <- function (p, shape1, shape2, lower.tail=TRUE) {
  stopifnot(length(p) == length(shape1) && length(p) == length(shape2))
  is.0 <- shape1 <= 0
  is.1 <- shape2 <= 0
  ok <- !(is.0 | is.1)
  x <- numeric(length(p))
  x[ok] <- qbeta(p[ok], shape1[ok], shape2[ok], lower.tail=lower.tail) # shape parameters are valid
  x[is.0 & !is.1] <- 0 # density concentrated at x = 0 (for alpha == 0)
  x[is.1 & !is.0] <- 1 # density concentrated at x = 1 (for beta == 0)
  x[is.0 & is.1] <- NA # shouldn't happen in our case (alpha == beta == 0)
  x
}

The confidence interval \([\tau_{-}, \tau_{+}]\) is then transformed into a confidence interval \([\log_2 \rho_{-}, \log_2 \rho_{+}]\) using the equality

\[ \rho = \frac{n_2 \tau}{n_1 (1 - \tau)} \]

If this interval contains 0, the null hypothesis \(H_0: \rho = 1\) cannot be rejected and the score of the non-significant candidate is set to 0. Otherwise the end of the interval closer to 0 is returned as a keyness score. Positive values thus indicate significant positive keywords and negative values indicate significant negative keywords.

My reference implementation LRC() below takes the same arguments as G2() as well as an additional confidence level \(1 - \alpha\). By default (correct=TRUE) a Bonferroni correction is applied to obtain the significance level \(\alpha' = \alpha / m\) for the confidence intervals of individual candidates. An additional benefit of LRC is that it also works for \(f_1 = 0\) or \(f_2 = 0\).

LRC <- function (f1, f2, N1, N2, conf.level=.95, correct=TRUE) {
  stopifnot(length(f1) == length(f2))
  stopifnot(all(f1 + f2 >= 1))

  ## exact confidence interval from conditional Poisson test (two-sided)
  tau <- binom.confint(f1, f1 + f2, conf.level=conf.level, correct=correct, alternative="two.sided")
  ifelse(f1 / N1 >= f2 / N2, 
         pmax(log2( (N2 / N1) * tau$lower / (1 - tau$lower) ), 0),  # p1 >= p2 -> use lower bound (clamped to >= 0)
         pmin(log2( (N2 / N1) * tau$upper / (1 - tau$upper) ), 0))  # p1 < p2  -> use upper bound (clamped to <= 0)
}

2.7 Variants of LRC

NB: These variants of LRC are not recommended as best-practice measures and included here only for completeness and for comparison in the experiments below.

Positive LRC computes a one-sided confidence interval and always returns its lower bound \(\log_2 \rho_{-}\). It thus identifies only positive keywords (\(\pi_1 > \pi_2\)); candidates with negative scores are not significant (but can still be included in a ranking and possibly considered in a linguistic interpretation). However, one cannot determine from these scores whether a term occurs more frequently in sample A or in sample B.

PositiveLRC <- function (f1, f2, N1, N2, conf.level=.95, correct=TRUE) {
  stopifnot(length(f1) == length(f2))
  stopifnot(all(f1 + f2 >= 1))

  ## exact confidence interval from conditional Poisson test (one-sided)
  tau <- binom.confint(f1, f1 + f2, conf.level=conf.level, correct=correct, alternative="greater")
  log2( (N2 / N1) * tau$lower / (1 - tau$lower) )
}

The version of LRC currently implemented in CQPweb is based on asymptotic confidence intervals from a normal approximation rather than the exact conditional Poisson test. It has serious numerical and conceptual problems, many of which are inherited from Hardie’s adjustment \(f_2 = \frac12\) in the case of \(f_2 = 0\). [Update] These problems are alleviated somewhat if Walter’s (1975) point estimate and a similar nearly unbiased estimate \[ \hat{\sigma}(\log r) = \sqrt{ \frac{1}{f_1 + \frac12} + \frac{1}{f_2 + \frac12} - \frac{1}{n_1 + \frac12} - \frac{1}{n_2 + \frac12} } \] for its standard deviation are used (Gart & mo Nam 1988: 324).

ApproxLRC <- function (f1, f2, N1, N2, conf.level=.95, correct=TRUE, adjust=c("walter", "hardie")) {
  stopifnot(length(f1) == length(f2))
  stopifnot(all(f1 >= 1))
  adjust <- match.arg(adjust)
  
  alpha <- 1 - conf.level                  # desired significance level 
  if (correct) alpha <- alpha / length(f1) # Bonferroni correction

  ## approximate confidence interval based on asymptotic standard deviation of log(rr)
  if (adjust == "hardie") {
    f2.disc <- pmax(f2, 0.5) #  with Hardie's (2014) questionable discounting
    lrr <- log((f1 / N1) / (f2.disc / N2))
    lrr.sd <- sqrt(1/f1 + 1/f2.disc - 1/N1 - 1/N2) # asymptotic s.d. of log(RR) according to Wikipedia
  } 
  else {
    lrr <- log((f1 + 0.5) / (N1 + 0.5)) - log((f2 + 0.5) / (N2 + 0.5))
    lrr.sd <- sqrt(1/(f1 + 0.5) + 1/(f2 + 0.5) - 1/(N1 + 0.5) - 1/(N2 + 0.5))
  }
    
  z.factor <- -qnorm(alpha / 2)                  # number of s.d. for asymptotic two-sided conf.int
  lrr <- ifelse(lrr >= 0, 
                pmax(lrr - z.factor * lrr.sd, 0), # log(RR) >= 0 -> use lower bound (clamped to >= 0)
                pmin(lrr + z.factor * lrr.sd, 0)) # log(RR) < 0  -> use upper bound (clamped to <= 0)

  lrr / log(2) # adjust to log2 units
}

3 Example from Evert et al. (2018)

As an illustrative example and for visualisations, we provide a data set of keyword candidates from the evaluation study of Evert et al. (2018) on discourses around multi-resistant pathogens (MRO) in Germany. In this study, an earlier version of LRC achieved slightly better evaluation results than LogRatio and log-likelihood, providing a balance between the low-frequency bias of the former and the high-frequency bias of the latter.

The tabular file sample_data.tsv contains document frequencies for 20,109 candidate terms in the target corpus of online articles on MRO (\(f_1\)) and in a newspaper reference corpus ($f_2$). The data set includes all terms with \(f_1 \geq 2\) in the target corpus. The sample sizes – i.e. number of text in the target (\(n_1\)) and referene (\(n_2\)) corpus – have to be provided separately.

Keywords <- read.delim("sample_data.tsv", quote="", stringsAsFactors=FALSE, fileEncoding="UTF-8")
n1 <- 1177   # target corpus (document count)
n2 <- 341805 # references corpus

We can now apply the different keyness measures to the example data set. Note that the default discount \(\lambda = 1\) of SimpleMaths is designed for token frequencies per million words, but our relative document frequencies are often considerably higher. Assuming one occurrence per document, relative frequency should increase by a factor corresponding to the average document size (which is very roughly 500 tokens in this data set), so \(\lambda = 100\) seems a reasonable choice.

Keywords <- transform(
  Keywords,
  G2 = G2(f1, f2, n1, n2),
  LogRatioHardie = LogRatio(f1, f2, n1, n2, adjust="hardie"),
  LogRatio = LogRatio(f1, f2, n1, n2),
  LogRatio.G2 = LogRatio(f1, f2, n1, n2, alpha=.05),
  SM = SimpleMaths(f1, f2, n1, n2, lambda=100),
  LRC = LRC(f1, f2, n1, n2),
  LRC.positive = PositiveLRC(f1, f2, n1, n2),
  LRC.hardie = ApproxLRC(f1, f2, n1, n2, adjust="hardie"),
  LRC.normal = ApproxLRC(f1, f2, n1, n2))

Among 20109 candidates there are 6143 significant positive keywords according to a \(G^2\) test with Bonferroni correction, and 4722 according to LRC.

To gain a first impression of the keyness measures, we can look at the top-20 keywords for each measure. Note that this superficial impression does not accurately reflect the usefulness of the measures for an actual corpus study that will take a much larger number of top-ranked candidates into account. The show.top.kw() function always shows LRC scores and corresponding ranks for comparison.

show.top.kw <- function (df, col, n=20) {
  idx <- order(Keywords[[col]], decreasing=TRUE)[seq_len(min(n, nrow(df)))]
  top <- Keywords[idx, c("lemma", "f1", "f2", col)]
  top$rank <- seq_len(n)
  if (col != "LRC") {
    df$r.LRC <- rank(-df$LRC, ties.method="first")
    top <- cbind(top, df[idx, c("LRC", "r.LRC")])
  }
  top
}
show.top.kw(Keywords, "G2")
##                    lemma  f1   f2        G2 rank       LRC r.LRC
## 663         Antibiotikum 944  367 10290.405    1  9.132239    59
## 5935                Keim 936  805  9325.546    2  8.071781   143
## 17607          resistent 778  244  8397.957    3  9.365925    48
## 5466           Infektion 780  699  7503.595    4  7.984664   152
## 1155            Bakterie 748  528  7388.316    5  8.298289   121
## 16949     multiresistent 624   70  7039.846    6 10.530982    11
## 13204              ander 574   18  6693.887    7 11.757507     1
## 3233             Erreger 665  473  6480.059    8  8.264729   123
## 8613             Patient 819 5370  5253.111    9  5.207546   700
## 9451           Resistenz 482  130  5069.018   10  9.420591    44
## 7185                MRSA 411   13  4711.916   11 11.530551     2
## 6425         Krankenhaus 733 7074  4073.472   12  4.640064   886
## 5170             Hygiene 420  403  3799.236   13  7.763843   177
## 1207           Bakterium 342  117  3471.324   14  9.019495    64
## 6138              Klinik 575 4169  3378.359   15  5.012148   803
## 16194         infizieren 392  754  3127.711   16  6.805293   326
## 6467     Krankenhauskeim 269   17  2990.714   17 10.672729     7
## 10645     Staphylococcus 258   11  2897.350   18 10.961830     4
## 13556             aureus 257   11  2885.592   19 10.955959     5
## 6462  Krankenhaushygiene 260   28  2829.297   20 10.164283    18
show.top.kw(Keywords, "LogRatio")
##                       lemma f1 f2 LogRatio rank       LRC r.LRC
## 2274               Colistin 86  0 15.61593    1 10.727321     6
## 19998                  über 79  0 15.49419    2 10.594453     9
## 7869        Multiresistente 78  0 15.47593    3 10.574436    10
## 14416                 desto 72  0 15.36121    4 10.448163    12
## 11697                   VRE 60  0 15.10017    5 10.156939    20
## 618   Antibiotika-resistent 55  0 14.97572    6 10.015960    23
## 16670                lieber 51  0 14.86781    7  9.892440    26
## 14183                   bis 43  0 14.62425    8  9.608762    34
## 17174                    of 42  0 14.59070    9  9.569094    36
## 10882         Superbakterie 41  0 14.55634   10  9.528322    37
## 2332                   DGKH 38  0 14.44809   11  9.398728    46
## 16811               meisten 35  0 14.33105   12  9.256585    49
## 4573                 Grünen 34  0 14.28983   13  9.205982    51
## 10883        Superbakterium 34  0 14.28983   14  9.205982    52
## 2749           ESBL-bildend 33  0 14.24739   15  9.153581    54
## 11711  Vancomycin-resistent 33  0 14.24739   16  9.153581    55
## 7202              MRSA-Fall 31  0 14.15858   17  9.042852    63
## 10532         Spitalhygiene 30  0 14.11204   18  8.984218    68
## 10678           Stattdessen 30  0 14.11204   19  8.984218    69
## 7212           MRSA-Patient 29  0 14.06395   20  8.923170    75
show.top.kw(Keywords, "LogRatio.G2")
##                       lemma f1 f2 LogRatio.G2 rank       LRC r.LRC
## 2274               Colistin 86  0    15.61593    1 10.727321     6
## 19998                  über 79  0    15.49419    2 10.594453     9
## 7869        Multiresistente 78  0    15.47593    3 10.574436    10
## 14416                 desto 72  0    15.36121    4 10.448163    12
## 11697                   VRE 60  0    15.10017    5 10.156939    20
## 618   Antibiotika-resistent 55  0    14.97572    6 10.015960    23
## 16670                lieber 51  0    14.86781    7  9.892440    26
## 14183                   bis 43  0    14.62425    8  9.608762    34
## 17174                    of 42  0    14.59070    9  9.569094    36
## 10882         Superbakterie 41  0    14.55634   10  9.528322    37
## 2332                   DGKH 38  0    14.44809   11  9.398728    46
## 16811               meisten 35  0    14.33105   12  9.256585    49
## 4573                 Grünen 34  0    14.28983   13  9.205982    51
## 10883        Superbakterium 34  0    14.28983   14  9.205982    52
## 2749           ESBL-bildend 33  0    14.24739   15  9.153581    54
## 11711  Vancomycin-resistent 33  0    14.24739   16  9.153581    55
## 7202              MRSA-Fall 31  0    14.15858   17  9.042852    63
## 10532         Spitalhygiene 30  0    14.11204   18  8.984218    68
## 10678           Stattdessen 30  0    14.11204   19  8.984218    69
## 7212           MRSA-Patient 29  0    14.06395   20  8.923170    75
show.top.kw(Keywords, "SM")
##                       lemma  f1  f2        SM rank       LRC r.LRC
## 13204                 ander 574  18 3195.1750    1 11.757507     1
## 7185                   MRSA 411  13 2530.4956    2 11.530551     2
## 16949        multiresistent 624  70 1739.7306    3 10.530982    11
## 10645        Staphylococcus 258  11 1659.0853    4 10.961830     4
## 13556                aureus 257  11 1652.6576    5 10.955959     5
## 6467        Krankenhauskeim 269  17 1527.0023    6 10.672729     7
## 9440    Reserveantibiotikum 162   3 1266.2433    7 11.041688     3
## 6462     Krankenhaushygiene 260  28 1214.8360    8 10.164283    18
## 645    Antibiotikaresistenz 180  15 1063.5683    9 10.163151    19
## 7207              MRSA-Keim 134   4 1020.1087   10 10.610060     8
## 2736                   ESBL 118   4  898.4115   11 10.412481    14
## 1189         Bakterienstamm 143  14  862.6289   12  9.861341    27
## 9451              Resistenz 482 130  852.7730   13  9.420591    44
## 12610         Wundinfektion 141  16  816.6735   14  9.725766    32
## 17607             resistent 778 244  812.3077   15  9.365925    48
## 211           Acinetobacter 112   6  810.3279   16 10.096282    21
## 596   Antibiotika-Resistenz 117   8  806.3299   17  9.973964    25
## 13373  antibiotikaresistent 132  14  796.3274   18  9.734820    31
## 17344            pneumoniae  98   2  787.5438   19 10.431809    13
## 8669             Penicillin 152  23  772.5627   20  9.516115    40
show.top.kw(Keywords, "LRC")
##                      lemma  f1 f2      LRC rank
## 13204                ander 574 18 11.75751    1
## 7185                  MRSA 411 13 11.53055    2
## 9440   Reserveantibiotikum 162  3 11.04169    3
## 10645       Staphylococcus 258 11 10.96183    4
## 13556               aureus 257 11 10.95596    5
## 2274              Colistin  86  0 10.72732    6
## 6467       Krankenhauskeim 269 17 10.67273    7
## 7207             MRSA-Keim 134  4 10.61006    8
## 19998                 über  79  0 10.59445    9
## 7869       Multiresistente  78  0 10.57444   10
## 16949       multiresistent 624 70 10.53098   11
## 14416                desto  72  0 10.44816   12
## 17344           pneumoniae  98  2 10.43181   13
## 2736                  ESBL 118  4 10.41248   14
## 9464  Resistenzentwicklung  78  1 10.28177   15
## 15752          gramnegativ  78  1 10.28177   16
## 2113           Carbapeneme  89  2 10.28002   17
## 6462    Krankenhaushygiene 260 28 10.16428   18
## 645   Antibiotikaresistenz 180 15 10.16315   19
## 11697                  VRE  60  0 10.15694   20

We also compute rankings for all keyness measures (but in a new data frame Rankings to keep the original free of clutter). We omit LogRatio.filter because its ranks are identical to LogRatio for all significant positive keywords; the same holds for LRC.positive.

my.rank <- function (x) rank(-x, ties.method="first")
Rankings <- transform(
  Keywords,
  r.G2 = my.rank(G2),
  r.LogRatioHardie = my.rank(LogRatioHardie),
  r.LogRatio = my.rank(LogRatio),
  r.SM = my.rank(SM),
  r.LRC = my.rank(LRC),
  r.LRC.hardie = my.rank(LRC.hardie),
  r.LRC.normal = my.rank(LRC.normal))

3.1 Overlap and frequency distribution

As a further indication of how the keyness measures differ, we compare lists of top-250 keywords from each measure. The overlap between such lists can be visualised with Venn diagrams:

tmp <- with(Rankings, data.frame(
  G2 = r.G2 <= 250, LogRatio = r.LogRatio <= 250, SM = r.SM <= 250,
  LRC = r.LRC <= 250, LRC.norm = r.LRC.normal <= 250, LRC.hardie = r.LRC.hardie <= 250))

tmp1 <- subset(tmp, G2 | LogRatio | SM | LRC, qw("G2 SM LRC LogRatio"))
venn(tmp1, ilcs=1.5, sncs=1.7, zcolor=muted.pal, box=FALSE)

tmp2 <- subset(tmp, LRC | LRC.hardie | LRC.norm, qw("LRC LRC.norm LRC.hardie"))
venn(tmp2, ilcs=1.5, sncs=1.7, zcolor=muted.pal, box=FALSE)

if (export.plots) {
  cairo_pdf(file="img/venn1_top250.pdf", width=8, height=8)
  venn(tmp1, ilcs=1.5, sncs=1.7, zcolor=muted.pal, box=FALSE)
  invisible(dev.off())
  
  cairo_pdf(file="img/venn2_top250.pdf", width=8, height=8)
  venn(tmp2, ilcs=1.5, sncs=1.7, zcolor=muted.pal, box=FALSE)
  invisible(dev.off())
}

A histogram of \(f_1\) for the top-250 candidate sets reveals that the overlap is primarily explained by differences in the frequency profiles. Note the logarithmic scale on the \(f_1\)-axis. LRC has the most balaned frequency distribution, covering a plausible range of mid-frequency terms, but excluding too general high-frequency terms.

tmp <- melt(Rankings, id=qw("lemma f1 f2"), measure=qw("r.G2 r.SM r.LRC r.LogRatio"), 
            variable.name="measure", value.name="rank")
tmp <- subset(tmp, rank <= 250)
hist.pal <- structure(muted.pal[1:4], names=qw("r.G2 r.SM r.LRC r.LogRatio"))

plt <- ggplot(tmp, aes(x=f1, after_stat(count), fill=measure)) + 
  scale_x_log10(limits=c(4, 2e3)) + 
  geom_density(alpha=.4) + 
  scale_fill_manual(values=hist.pal, labels=expression(G^2, SM, LRC, LogRatio)) + 
  xlab("frequency in target corpus") + ylab("density of keywords") +
  theme_grey(base_size=15) + theme(legend.position="top")
print(plt)

if (export.plots) {
  cairo_pdf(file="img/histogram_f1_top250.pdf", width=8, height=8)
  print(plt)
  invisible(dev.off())
}
tmp <- melt(Rankings, id=qw("lemma f1 f2"), measure=qw("r.G2 r.SM r.LRC r.LogRatio"), 
            variable.name="measure", value.name="rank")
tmp <- subset(tmp, rank <= 250)
hist.pal <- structure(muted.pal[1:4], names=qw("r.G2 r.SM r.LRC r.LogRatio"))
mk.plt <- function (x) {
  ggplot(x, aes(x=f1, after_stat(count), fill=measure)) + 
  scale_x_log10(limits=c(4, 2e3)) + 
  geom_density(alpha=.4) + 
  scale_fill_manual(values=hist.pal, labels=expression(G^2, SM, LRC, LogRatio)) + 
  xlab("frequency in target corpus") + ylab("density of keywords") +
  theme_grey(base_size=15) + theme(legend.position="top")
}

if (export.plots) {
  cairo_pdf(file="img/histogram_f1_top250_G2.pdf", width=12, height=4)
  print(mk.plt(subset(tmp, measure %in% qw("r.G2"))))
  invisible(dev.off())
  cairo_pdf(file="img/histogram_f1_top250_G2_LR.pdf", width=12, height=4)
  print(mk.plt(subset(tmp, measure %in% qw("r.G2 r.LogRatio"))))
  invisible(dev.off())
  cairo_pdf(file="img/histogram_f1_top250_G2_LR_LRC.pdf", width=12, height=4)
  print(mk.plt(subset(tmp, measure %in% qw("r.G2 r.LogRatio r.LRC"))))
  invisible(dev.off())
}

4 Topographic maps of keyness measures

For a given pair of samples A and B, the sample sizes \(n_1\) and \(n_2\) are fixed for all candidates in the data set, and keyness scores depend only on the pair of observed frequencies \((f_1, f_2)\). Interpreting these pairs as coordinates in a two-timensional plane, keyness measues can be visualised as topographic maps over this plane. Such plots can give further insights into the difference mathematical properties of keyness measures.

The function topographic.map(LogRatio, ...) visualises the keyness measure implemented by LogRatio(), with detailed control over the frequency ranges to be plotted and many other graphical aspects. If z.log=TRUE then keyness scores are visualised on a logarithmic scale (supporting both positive and negative scores). A data frame with candidate terms and their frequencies can be passed in the optional argument data= and be overlaid as a scatterplot.

signed.log <- function (x) sign(x) * log10(abs(x) + 1) 
signed.exp <- function (x) sign(x) * (10^abs(x) - 1)
topographic.map <- function (MEASURE, xlim=c(0, 1e5), ylim=c(1e2, 1e6), zlim=NULL,
                             N1=10e6, N2=10e6, grid=400, nlevels=20, z.log=FALSE, main="", 
                             data=NULL, pch=20, cex=.5, col="#00552288", jitter=.05, ...) {
  xlim.log <- signed.log(xlim)
  ylim.log <- signed.log(ylim)
  x.log <- seq(xlim.log[1], xlim.log[2], length.out=grid)
  y.log <- seq(ylim.log[1], ylim.log[2], length.out=grid)
  x <- signed.exp(x.log)
  y <- signed.exp(y.log)
  z.mat <- outer(y, x, MEASURE, N1=N1, N2=N2, ...) # evaluate keyness measure on grid
  if (z.log) z.mat <- signed.log(z.mat)
  col.range <- c(hsv(h=2/3, v=1, s=(100:1)/100), rgb(1,1,1), hsv(h=0, v=1, s=(1:100/100))) 
  if (is.null(zlim)) zlim <- c(-1,1) * max(abs(z.mat))
  image(x.log, y.log, t(z.mat), xlim=xlim.log, ylim=ylim.log, zlim=zlim,
        col=col.range, xlab=expression(log[10](f[2]+1)), ylab=expression(log[10](f[1]+1)), main=main)
  contour(x.log, y.log, t(z.mat), nlevels=nlevels, labcex=1, add=TRUE)
  if (!is.null(data)) {
    set.seed(42)
    points(jitter(signed.log(data$f2), amount=jitter), jitter(signed.log(data$f1), amount=jitter),
           pch=pch, cex=cex, col=col)
  }
}

4.1 Three typical settings

4.1.1 Comparison of two large corpora

\(n_1 = n_2 = 100 \text{M}\) tokens, assuming that we’re testing \(m = 100 \text{k}\) candidate terms (for Bonferroni correction). Note that we have to pass the adjusted significance level \(\alpha'\) manually because there is no underlying data set.

conf.adjust <- 1 - .05 / 1e5 # Bonferroni-adjusted confidence level of 95% for m=100,000 candidates
my.topomap <- function (MEASURE, ...) topographic.map(MEASURE, N1=100e6, N2=100e6, ...)
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")

my.topomap(LogRatio, adjust="hardie", main="LogRatio (Hardie)")

my.topomap(LogRatio, main="LogRatio")

my.topomap(LogRatio, alpha=(1 - conf.adjust), correct=FALSE, main="LogRatio + LLR filter")

my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")

my.topomap(PositiveLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (positive)")

my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")

my.topomap(ApproxLRC, adjust="hardie", conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx. by Hardie)")

my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")

my.topomap(SimpleMaths, lambda=1e-2, log=TRUE, main="log SimpleMaths (λ = .01)")

my.topomap(SimpleMaths, lambda=1e+2, log=TRUE, main="log SimpleMaths (λ = 100)")

if (export.plots) {
  cairo_pdf(file="img/topomap_100M_100M.pdf", width=12, height=8)
  par(mar=c(4, 4, 1, 1)+.2, mfrow=c(2,3))
  my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
  my.topomap(LogRatio, main="LogRatio")
  my.topomap(LogRatio, alpha=(1 - conf.adjust), main="LogRatio + LLR filter")
  my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
  my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
  my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
  invisible(dev.off())
}

4.1.2 Comparison of two small corpora

\(n_1 = n_2 = 1 \text{M}\) tokens, assuming that we’re testing \(m = 5 \text{k}\) candidate terms.

conf.adjust <- 1 - .01 / 5e3
my.topomap <- function (MEASURE, ...) topographic.map(MEASURE, N1=1e6, N2=1e6, xlim=c(0, 1e5), ylim=c(5, 1e5), ...)
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")

my.topomap(LogRatio, adjust="hardie", main="LogRatio (Hardie)")

my.topomap(LogRatio, main="LogRatio")

my.topomap(LogRatio, alpha=(1 - conf.adjust), correct=FALSE, main="LogRatio + LLR filter")

my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")

my.topomap(PositiveLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (positive)")

my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")

my.topomap(ApproxLRC, adjust="hardie", conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx. by Hardie)")

my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")

my.topomap(SimpleMaths, lambda=1e-2, log=TRUE, main="log SimpleMaths (λ = .01)")

my.topomap(SimpleMaths, lambda=1e+2, log=TRUE, main="log SimpleMaths (λ = 100)")

if (export.plots) {
  cairo_pdf(file="img/topomap_1M_1M.pdf", width=12, height=8)
  par(mar=c(4, 4, 1, 1)+.2, mfrow=c(2,3))
  my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
  my.topomap(LogRatio, main="LogRatio")
  my.topomap(LogRatio, alpha=(1 - conf.adjust), main="LogRatio + LLR filter")
  my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
  my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
  my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
  invisible(dev.off())
}

4.1.3 Comparison of small target corpus against large reference

\(n_1 = 1 \text{M}\) vs. \(n_2 = 100 \text{M}\) tokens, assuming that we’re testing \(m = 5 \text{k}\) candidate terms.

conf.adjust <- 1 - .01 / 5e3
my.topomap <- function (MEASURE, ...) topographic.map(MEASURE, N1=1e6, N2=100e6, xlim=c(0, 1e6), ylim=c(5, 1e5), ...)
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")

my.topomap(LogRatio, adjust="hardie", main="LogRatio (Hardie)")

my.topomap(LogRatio, main="LogRatio")

my.topomap(LogRatio, alpha=(1 - conf.adjust), correct=FALSE, main="LogRatio + LLR filter")

my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")

my.topomap(PositiveLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (positive)")

my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")

my.topomap(ApproxLRC, adjust="hardie", conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx. by Hardie)")

my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")

my.topomap(SimpleMaths, lambda=1e-2, log=TRUE, main="log SimpleMaths (λ = .01)")

my.topomap(SimpleMaths, lambda=1e+2, log=TRUE, main="log SimpleMaths (λ = 100)")

if (export.plots) {
  cairo_pdf(file="img/topomap_1M_100M.pdf", width=12, height=8)
  par(mar=c(4, 4, 1, 1)+.2, mfrow=c(2,3))
  my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
  my.topomap(LogRatio, main="LogRatio")
  my.topomap(LogRatio, alpha=(1 - conf.adjust), main="LogRatio + LLR filter")
  my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
  my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
  my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
  invisible(dev.off())
}

4.2 Visualisation of the example data set

Finally, we visualise the realistic setting of the example data set based on document frequencies. As a point of comparison, we show the distribution of candidate terms that belong to any of the top-250 lists compared above.

tmp <- Rankings[, grepl("^r[.]", colnames(Rankings), perl=TRUE)]
idx <- apply(tmp, 1, min) <= 250
ShowCand <- Rankings[idx, qw("lemma f1 f2")]
conf.adjust <- 1 - .01 / nrow(Rankings)
my.topomap <- function (MEASURE, ...) topographic.map(MEASURE, N1=n1, N2=n2, xlim=c(0, 100e3), ylim=c(1, 1000), data=ShowCand, cex=1, col="#006622",...)
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")

my.topomap(LogRatio, adjust="hardie", main="LogRatio (Hardie)")

my.topomap(LogRatio, main="LogRatio")

my.topomap(LogRatio, alpha=(1 - conf.adjust), correct=FALSE, main="LogRatio + LLR filter")

my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")

my.topomap(PositiveLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (positive)")

my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")

my.topomap(ApproxLRC, adjust="hardie", conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx. by Hardie)")

my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")

my.topomap(SimpleMaths, lambda=1e-2, log=TRUE, main="log SimpleMaths (λ = .01)")

my.topomap(SimpleMaths, lambda=1e+2, log=TRUE, main="log SimpleMaths (λ = 100)")

if (export.plots) {
  cairo_pdf(file="img/topomap_Keywords.pdf", width=12, height=8)
  par(mar=c(4, 4, 1, 1)+.2, mfrow=c(2,3))
  my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
  my.topomap(LogRatio, main="LogRatio")
  my.topomap(LogRatio, alpha=(1 - conf.adjust), main="LogRatio + LLR filter")
  my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
  my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
  my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
  invisible(dev.off())
}

5 References

  • Egbert, J. and Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1): 77–104.
  • Evert, S. (2008). Corpora and collocations. In Lüdeling, A. and Kytö, M. (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York.
  • Evert, S., Dykes, N., and Peters, J. (2018). A quantitative evaluation of keyword measures for corpus-based discourse analysis. Presentation at the Corpora & Discourse International Conference (CAD 2018), Lancaster, UK.
  • Fay, M. P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1): 53–58.
  • Gabrielatos, C. and Marchi, A. (2012). Keyness: Appropriate metrics and practical issues. Presentation at the Corpora and Discourse Studies Conference (CADS 2012), Bologna, Italy.
  • Gart, J. J. and mo Nam, J. (1988). Approximate interval estimation of the ratio of binomial parameters: A review and corrections for skewness. Biometrics, 44(2): 323–338.
  • Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1): 137–165.
  • Hardie, A. (2012). CQPweb – combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3): 380–409.
  • Hardie, A. (2014). A single statistical technique for keywords, lockwords, and collocations. Internal CASS working paper no. 1, unpublished.
  • Kilgarriff, A. (2009). Simple maths for keywords. In Proceedings of the Corpus Linguistics 2009 Conference, Liverpool, UK.
  • Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses. Springer, New York, 3rd edition.
  • Pojanapunya, P. and Watson Todd, R. (2018). Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 14(1): 133–167.
  • Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pages 1–6, Hong Kong.
  • Scott, M. (1997). PC analysis of key words – and key key words. System, 25(2): 233–245.
  • Walter, S. D. (1975). The distribution of Levin’s measure of attributable risk. Biometrika, 62(2): 371–374.